Ladislas Nalborczyk
LPC, LNC, CNRS, Aix-Marseille Univ
Slides available at https://github.com/lnalborczyk/intro_phil_stat_2022
A scientific theory can be defined as a set of logical propositions that posits causal relationships between observable phenomena.
These logical propositions are originally abstract and broad (e.g., “every object responds to the force of gravity in the same way”) but lead to concrete and specific predictions that are empirically testable (e.g., “the falling speed of two objects A and B should be the same, all other things being equal”).
The concept of a scientific theory is not a unitary concept though. As an example, Meehl (1986) lists three kinds of theories:
According to Campbell (1990), the (intuitive) logical argument of science has the following form:
However, this argument is a fallacious argument known as the affirmation of the consequent. The invalidity comes from the existence of the cross-hatched area, that is, other possible explanations for B, C, and D being observed (figure from Campbell, 1990).
We can not confirm theories, but maybe we can at least think of a way of disproving them? According to Popper’s view, a theory can be considered as falsifiable if it can be shown to be false. But what does it mean for a theory to be false?
Here we should note that the falsifiability of early Popper concerns the problem of demarcation (i.e., what is science and what is pseudoscience), and defines pseudosciences as composed of non falsifiable theories (i.e., theories that do not allow the possibility of being disproved).
But when it comes to describe how science works (descriptive purposes) or to know how scientific enquiries should be lead (prescriptive purposes), science is usually not described by the falsification standard, as Popper himself recognised and argued. In fact, deductive falsification is impossible in nearly every scientific context (McElreath, 2016).
In the next sections, we discuss some of the reasons that prevent (almost) any scientific theory to be strictly falsified (in a logical sense), namely: i) the distinction between theoretical and statistical models ii) the problem of measurement iii) the problem of continuous hypotheses, and iv) the Duhem-Quine problem.
A statistical model is a device that connect theories to data. It can be defined as an instantiation of a theory as a set of probabilistic statements (Rouder, Morey, & Wagenmakers, 2016).
Theoretical models and statistical models are usually not equivalent as many different theoretical models can correspond to the same probabilistic description. Conversely, different probabilistic descriptions can be derived from the same theoretical model. In other words, there is no one-to-one mapping between the two worlds, which render the induction from the statistical model to the theoretical model quite tricky (figure from McElreath, 2020).
Causal and inferential relations between substantive theory, statistical hypothesis, and observational data (figure from Meehl, 1990).
Another problem yet, as stressed by Paul Meehl, is that while statistical methodology usually deals with the issue of assessing the validity of statistical hypotheses from observations, it does not address, and maybe can not address, the issue of assessing the validity of substantive theories from the corroboration or disconfirmation of statistical hypotheses.
The logic of falsification is pretty simple and rests on the power of the modus tollens. This argument (whose exposition, for some reason, usually involves swans) can be presented as follows:
This argument is perfectly valid and works well for logical statements (statements that are either true or false). However, the first problem that arises when we try to apply this reasoning to the “real world” is the problem of observation error: observations are prone to error, especially at the boundaries of knowledge (McElreath, 2016).
.pull-left[ According to Einstein, neutrinos can not travel faster than the speed of light. Thus, any observation of faster-than-light neutrinos would act as a strong falsifier of Einstein’s special relativity. In 2011 however, a large team of respected physicists announced the detection of faster-than-light neutrinos (see the Wikipedia article: https://en.wikipedia.org/wiki/Faster-than-light_neutrino_anomaly).]
.pull-right[]
And they were right to suspect something was wrong with the measurement: A fiber optic cable was attached improperly, and a clock oscillator was ticking too fast…
Another problem arises from a misapplication of deductive syllogistic reasoning (a misapplication of the modus tollens). The problem (the “permanent illusion”, as put by Gigerenzer, 1993) is that most scientific hypotheses are not really of the kind “all swans are white” but rather of the form:
Given this hypothesis, what can we conclude if we observe a black swan? Not much. To understand why, let’s translate it first to a more common statement in psychological research (from Cohen, 1994):
But because of the probabilistic premise (i.e., the “highly unlikely”) this conclusion is invalid. Why?
Consider the following argument (still from Cohen, 1994, borrowed from Pollard & Richardson, 1987):
This conclusion is not sensible (the argument is invalid), because it fails to consider the alternative to the premise, which is that if this person were not an American, the probability of being a member of Congress would be 0.
This is formally exactly the same as:
Which is as much invalid as the previous argument, because i) the premise (the hypothesis) is probabilistic/continuous rather than discrete/logical and ii) because it fails to consider the probability of the alternative. Thus, even without measurement/observation error, this problem would prevent us from applying the modus tollens to our hypothesis, thus preventing any possibility of strict falsification.
Again another problem is known as the Duhem–Quine thesis/problem (aka the underdetermination problem). In practice, when a substantive theory \(T\) happens to be tested, some hidden assumptions, such as auxiliary theories about the instruments we use, are also put under examination (Meehl, 1978; 1990; 1997).
When we test a theory predicting that “if \(O_{1}\)” (some manipulation), “then \(O_{2}\)” (some observation), what we actually mean is that we should observe this relation, if and only if all of the above (i.e., the auxiliary theories, the instrument theories, the particulars, etc.) are true.
Thus, the logical structure of an empirical test of a theory \(T\) can be described as the following conceptual formula (Meehl, 1978; 1990; 1997):
\[(T \land A_{t} \land C_{p} \land A_{i} \land C_{n}) \to (O_{1} \supset O_{2})\]
where the \(\land\) are conjunctions (“and”), the arrow \(\to\) denotes deduction (“follows that …”), and the horseshoe \(\supset\) is the material conditional (“If \(O_{1}\), Then \(O_{2}\)”). \(A_{t}\) is a conjunction of auxiliary theories, \(Cp\) is a ceteribus paribus clause (i.e., we assume there is no other factor exerting an appreciable influence that could obfuscate the main effect of interest), \(A_{i}\) is an auxiliary theory regarding instruments, and \(C_{n}\) is a statement about experimentally realised conditions (i.e., we assume that there is no systematic error/noise in the experimental settings).
\[(T \land A_{t} \land C_{p} \land A_{i} \land C_{n}) \to (O_{1} \supset O_{2})\]
However, although the modus tollens is a valid figure of the implicative syllogism for logical statements (e.g., “all swans are black”), the neatness of Popper’s classic falsifiability concept is fuzzed up by the acknowledgement of the actual form of an empirical test. Obtaining falsificative evidence during an empirical test does not only falsify the substantive theory \(T\), but it does falsify all the left-side of the above statement. In other words, what we have achieved by our laboratory or correlational “falsification” is a falsification of the combined claims \(T \land A_{t} \land C_{p} \land A_{i} \land C_{n}\), which is probably not what we had in mind when we did the experiment (Meehl, 1990).
To sum up, failing to observe a predicted outcome does not necessarily mean that the theory itself is wrong, but rather that the conjunction of the theory and the underlying assumptions at hand are invalid (Lakatos, 1978; Meehl, 1978; 1990).
Falsification in science is almost always consensual, not logical (McElreath, 2020). A theoretical claim is considered to be falsified only when multiple lines of converging evidence have been obtained, by independent teams of researchers, and usually after several years or decades of critical discussion. The “falsification of a theory” then appears as a social result, issued from the community of scientists, and (almost) never as a deductive falsification.
How can we accumulate evidence in favour of or against a theory? That’s were statistics comes into play. There are several philosophical frameworks for statistical inference, which differ by their assumptions and by their definition of what counts as evidence in favour or against a theory.
Let’s say we are interested in height differences between women and men…
We are going to simulate t-values computed on samples generated under the assumption of no difference between women and men (the null hypothesis \(\mathcal{H}_{0}\)).
alpha <- .05 # significance threshold (alpha)
abs(qt(alpha / 2, df = t.test(x = men, y = women)$parameter) ) # two-sided critical t-value[1] 1.972019
A p-value is simply a tail area (an integral) computed from the distribution of test statistics under (given) the null hypothesis. It gives the probability of observing the data we observed or more extreme data, given that the null hypothesis is true (Wagenmakers et al., 2007).
\[p[\mathbf{t}(\mathbf{x}^{\text{rep}} ; \mathcal{H}_{0}) \geq t(x)]\]
.pull-left[ According to Fisher, the p-value is thought to measure the strength of evidence against the null hypothesis: the lower the p-value, the stronger the evidence against the null hypothesis. But we know that p-values at best correlate (in a loose meaning) with evidence (e.g., see Wagenmakers, 2007).]
–
.pull-right[ Neyman & Pearson used p-values and significance thresholds as a way of controlling error rates in the long run. In this perspective, we don’t interpret the p-value, we only “classify” results as significant or non-significant. This strict procedure allows keeping error rates at a fixed level (given that the null hypothesis is true, see this blogpost).]
The modus tollens is one of the strongest rule of inference in logic. It works perfectly well in science when we deal with hypotheses of the following form: If \(\mathcal{H}_{0}\) is true, then we should not observe \(x\). We observed \(x\). Then, \(\mathcal{H}_{0}\) is false.
BUT, most of the time, we deal with continuous, probabilistic hypotheses…
The Fisherian inference (induction) is of the form: If \(\mathcal{H}_{0}\) is true, then we should PROBABLY not observe \(x\). We observed \(x\). Then, \(\mathcal{H}_{0}\) is PROBABLY false.
However, as we have seen previously, this argument is invalid. The modus tollens does not apply to probabilistic statements (e.g., Pollard & Richardson, 1987; Rouder, Morey, Verhagen, Province, & Wagenmakers, 2016).
Confidence intervals are basically regions of significance. Thus, they have to be interpreted as cautiously as p-values, and are submitted to the same flaws.
A 95% confidence interval does not mean that there is a 95% probability that the interval contains the population value of the parameter (remember the modus tollens fallacy).
The only correct interpretation is to think about it in terms of coverage proportion (see next slide and this blogpost).
A 95% confidence interval represents a statement about the procedure, not about the parameter. It means that, in the long run, 95% of the confidence intervals we could compute (in an exact replication of the experiment) would contain the population value of the parameter. But we can not say anything about the particular confidence interval we computed in this particular experiment…
Frequentist statistics (e.g., p-values and confidence intervals) make sense under the frequentist interpretation of probability: they refer to long-run frequencies.
P-values are simply tail areas in probability distributions. It means that they are conditional on some distribution. But it also means that computing a p-value is a generic statistical procedure, it’s not inextricable from the null hypothesis (e.g., see Bayesian p-values).
Confidence intervals are basically regions of significance. Thus, they are prone to the very same limits as p-values.
Instead of testing only one hypothesis (the null hypothesis), Bayes factors allow comparing two hypotheses. For instance, let’s say we are comparing two models:
\[\underbrace{\dfrac{p(\mathcal{H}_{0}|D)}{p(\mathcal{H}_{1}|D)}}_\text{posterior odds} = \underbrace{\dfrac{p(D|\mathcal{H}_{0})}{p(D|\mathcal{H}_{1})}}_\text{Bayes factor} \times \underbrace{\dfrac{p(\mathcal{H}_{0})}{p(\mathcal{H}_{1})}}_\text{prior odds}\]
\[\text{evidence}\ = p(D | \mathcal{H}) = \int p(\theta | \mathcal{H}) p(D | \theta, \mathcal{H}) \text{d}\theta\]
The evidence in favour of a model corresponds to the marginal likelihood of a model. In other words, it is an averaged likelihood weighted by the prior predictions of the model, which makes the Bayes factor a kind of Bayesian likelihood ratio.
Let’s say we want to estimate the bias \(\theta\) of a coin. For convenience, we can write our predictions as two Beta-Binomial models:
\[ \begin{align} \mathcal{M_{1}} : y_{i} &\sim \mathrm{Binomial}(n, \theta) \\ \theta &\sim \mathrm{Beta}(6, 10) \\ \end{align} \]
\[ \begin{align} \mathcal{M_{2}} : y_{i} &\sim \mathrm{Binomial}(n, \theta) \\ \theta &\sim \mathrm{Beta}(20, 12) \\ \end{align} \]